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Abstract 

Cytotoxic T lymphocytes (CTLs) are immune system cells that are 
thought to play an important role in controlling HIV infection. We de- 
velop a stochastic ODE model of HIV-CTL interaction that extends cur- 
rent deterministic ODE models. Based on this stochastic model, we con- 
sider the effect of CTL attack on intrahost HIV lineages assuming CTLs 
attack several epitopes with equal strength. In this setting, we intro- 
duce a limiting version of our stochastic ODE under which we show that 
the coalescence of HIV lineages can be described by a simple paintbox 
construction. Through numerical experiments, we show that our results 
under the limiting stochastic ODE accurately reflect HIV lineages under 
CTL attack when the HIV population size is on the low end of its hypoth- 
esized range. Current techniques of HIV lineage construction depend on 
the Kingman coalescent. Our results give an explicit connection between 
CTL attack and HIV lineages. 

1 Introduction 

Cytotoxic T lymphocytes (CTLs) are immune system cells that kill pathogen 
infected host cells. In the context of HIV infection, considerable exper- 
imental evidence suggests that CTLs play a central role in controlling 
infection and shaping HIV diversity, e.g. [3J 1111 1181 155] . 

Roughly, when HIV enters a host cell, typically a CD4 + cell, certain 
mechanisms within the cell cut up HIV proteins into small pieces (usually 
8—11 amino acids long) and present these peptides on the surface of the 
cell in the form a peptide-MHC complex (pMHC) [6 . CTLs can bind to 
pMHC complexes and then destroy the presenting cell, but critically each 
CTL possesses receptors that can bind to a limited pattern of peptides. 
An HIV peptide that is attacked by CTLs is referred to as an epitope. 

When CTLs attack a given epitope, HIV infected cells possessing that 
epitope are killed off. However, due to its high mutation rate, many 
variants of HIV exist during any moment of infection. As a result, infected 



cells possessing HIV variants that do not produce the attacked epitope 
may exist prior to CTL attack or arise during the attack. Such variants, 
which are at a selective advantage due to the CTL attack, will proliferate 
and come to dominate the HIV population. This hypothetical picture 
has been confirmed in many experimental HIV studies, e.g. [15] ■ Yet 
despite the putative role of CTLs in controlling HIV infection and the 
corresponding importance of HIV genetic diversity in evading CTL attack, 
the impact of CTL attack on intrahost HIV genetic diversity is not well 
understood. 

Most current theoretical tools used in HIV research do not link CTL 
models to HIV genetic diversity. On one hand, HIV-CTL interaction has 
been modeled since the beginning of the HIV epidemic (see [2B, 27 for a 
review). Various models are possible, but the standard model consists of 
a deterministic ODE composed of variables for the population size of HIV 
virions, infected and uninfected CD4 + cells, and CTLs targeting infected 
CD4 + cells. While the standard model and its many variations give a 
dynamic picture of HIV and CTL population sizes, they do not connect 
CTL attack to HIV population genetics. 

On the other hand, tools from population genetics that do not explic- 
itly model CTL attack have been applied to HIV. Rodrigo and coworkers 
used variants of the Kingman coalescent to explore the HIV life cycle and 
construct inference algorithms based on HIV genetic samples [31J 1301 [8] . 
The popular programs BEAST and LAMARC, which are used to make 
statistical inferences based on HIV genetic data, assume a Kingman coa- 
lescent [7lfT9], 

In this work, we present results that connect an ODE model of HIV 
population dynamics under CTL attack to HIV population genetics. More 
specifically, we consider a stochastic ODE that models HIV escape from 
CTL attack at multiple epitopes sometime during the chronic phase of 
infection. Our stochastic ODE describes the dynamics of the HIV pop- 
ulation in terms of discrete birth, death, and mutation events, allowing 
us to specify lineages once the dynamics are given. We show that un- 
der a certain small population limit our stochastic ODE connects to the 
deterministic ODE models described above. 

To connect to HIV population genetics, we consider a collection of 
HIV infected cells sampled after HIV has escaped CTL attack. Given 
a realization of the stochastic ODE dynamics, the lineages of these in- 
fected cells can be traced back to the time at which CTL attack initiates, 
thereby forming a genealogy. For simplicity, we assume CTL attack of 
equal strength at each considered epitope, a situation we refer to as sym- 
metric attack. In this setting, our main result characterizes the state of 
the genealogy at the time when CTL attack initiates. Further, we show 
that HIV escape mutations produce significant stochasticity in the HIV 
population dynamics. 

We analyze our stochastic ODE using methods similar to those used 
by Iwasa, Michor, Komarova, and Nowak [13] and Durrett, Schmidt, and 
Schweinsberg [S] in their study of cancer pathways. Hermisson and Pen- 
nings [121126] also used similar techniques in an abstract setting applicable 
to HIV. In all these works and our own, the dynamics of mutations present 
at low levels in the overall population are well approximated by branching 
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processes. Rouzine and Coffin considered an HIV model that bears some 
similarity to our HIV-CTL model [32) . but their analysis and goals differ 
from ours. 

Our lineage construction is similar in spirit to that of several authors, 
but there are significant differences between our underlying model and 
that of previous authors. In [141 110] the authors considered lineages from 
a population that has undergone a strong selective sweep, while in [2] the 
authors considered lineages from a population under selection-mutation 
equilibrium. Both these works considered a fixed size, Moran model with 
weak mutation rates. In our case, the stochastic ODE considered does 
not assume a fixed population size and we consider a strong mutation 
rate reflective of HIV biology. 

In section [2] we describe our model. In section [3] we describe our 
theoretical results along with associated numerical results. In section [4] 
we discuss some implications of our results. Sections [5] and [6] provide 
proofs of the results presented in Section [3] In these two sections, we 
have endeavored to focus on the intuition behind the proofs. Our hope is 
that the mathematics presented in these sections contributes to intuition 
and biological motivation. We place arguments that are mathematically 
technical, and unnecessary for intuition, in the appendix. 

2 A Model of HIV Dynamics Under CTL 
Attack 

To specify our model, in section |2.1| we introduce terminology that will 
help characterize the CTL attack. In section [2~2"l we introduce our stochas- 
tic ODE model and connect it to a deterministic ODE similar to those 
mentioned in the introduction. In section [2~3| we specify a specific param- 
eter choice for our stochastic ODE that models symmetric CTL attack. 
Finally, in section |2.4| we discuss genealogies within the context of our 
HIV-CTL model. 

2.1 Escape Graph 

We model an HIV population exposed to attack at e epitopes. To do 
this, we categorize the HIV infected cells by the presence, represented 
by a 0, or absence, represented by a 1, of a given epitope. Since there 
are e epitopes, the different HIV infected cell variants can be associated 
with a binary number of length e. For example if e = 2, then the HIV 
infected cell variant, hereafter we simply say variant, 10 represents an 
infected cell containing only the second epitope. Intuitively, we think of 
l's as representing mutations that alter a gene on the HIV genome that 
is responsible for producing the attacked epitope. 

We let £ be the set of possible variants. In this work, we focus on two 
possible choices for £. In the first, which we label as £f n ll, we consider 
every possible combination of epitopes. For example, if e = 3 we define 

£f u n = {000, 001, 010, 011, 100, 101, 110, 111}. (2.1) 
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In the second choice, which we label as fiinear, we consider only variants 
that, from left to right, contain a sequence of all l's followed by a sequence 
of all O's. In the case e = 3 we define 



flinoar = {000, 100, 110, 111} (2.2) 

Given £ we define a graph G which we call the escape graph of £ . G 
is formed from vertices labeled by elements of £ and arrows that connect 
a vertex with label v' to one with label v if a single epitope mutation can 
change v' to v. When £ = £f u n and £ — £ii noa r we refer to the associated 
G as the full and linear escape graph, respectively. Figures [T] and [2] show 
the full and linear escape graph, respectively, in the case e = 3. 




Full Escape Graph 



Figure 1: Full Escape Graph for e = 3. 



For any v £ £ , V(v) is the set of elements in £ that can be changed 
into v by transforming a single into a 1. Intuitively, we think of V(v) 
as the variants that can be transformed into v by a single mutation 
and the V stands for parents. To be clear, if v = 110 then we have 
V(v) = {010, 100} and V(v) = {100} for the full and linear escape 
graphs respectively. 

We say that the HIV population has escaped CTL attack when all 
infected cells are of type 111. . . 1. In other words, mutations that remove 
each of the attacked epitopes have fixed in the HIV population. 
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Linear Escape Graph 



Figure 2: Linear Escape Graph for e = 3. 



2.2 ODE 

Let h represent the number of uninfected CD4 + cells that are targets 
for HIV infection. For each v € £ let e v be the number of CD4 + cells 
infected by v variants. We assume birth and death rates for h and the 
e v as specified in Table [T] A and g represent the birth and death rates 
respectively of a CD4 + cell in the absence of HIV infection. b v h and k v are 
the birth and death rate of an variant, infected CD4 + cell. An infected 
CD4 + birth event corresponds to an uninfected CD4 + death event, so 
uninfected CD4 + cells have an additional death term beyond g. 

We let fi be the rate per infection event at which mutations occur that 
remove any one of the e epitopes. Correspondingly, new v variants arise 
from mutations in v' £ V(v) with a rate given in the 'mutation event' 
rate column in Table [I] Notice that a 'mutation event' in Table [I] refers 
to the creation of a v variant from a mutation in some variant contained 
in V(v), not the mutation of an variant itself. 

Define P(f{t)) to be a Poisson process with jump rate f(t) at time t. 



5 



cell type (# of cells) 


birth rate 


death rate 


mutation event 


uninfected (h) 


A 






v infected (e„) 


b v h 


k v 


A* J2v'ev(v) be v 'h 



Table 1: Birth, Death, and Mutation Rates 



Then, given the rates in Table [T] we have the following stochastic ODE, 
dh = dP(X) - dP(gh) - dP(b v ,e v ,h) (2.3) 



'ee 



de v — dP(b v e v h) — dP(k v e v ) + dP(/j,b v ie v ih). 

v'ev(v) 

where the second equation directly above applies for all v 6 £ . Each P in 



(2.3| represents an independent Poisson process run at the specified rate, 
to avoid cumbersome notation we do not use a distinct notation for each 
of these processes. In (2.3 1 and throughout this work, we ignore back 



mutations, a mutation from a variant to a less fit variant. Ignoring such 
mutations does not affect our results. 

To make our system variables (h and the e„) O(l), we rescale h and 
each e„ by EI and E respectively. Intuitively, EI and E correspond to the 
order at which uninfected but infectable CD4 + cells and infected, acti- 
vated CD4 + cells capable of producing virions exist during HIV infection, 
respectively. We set H = X/g , the steady state of uninfected cells in the 
absence of HIV infection. This scaling is supported by empirical results 
suggesting that, at least prior to AIDS onset, the number of uninfected 
CD4 + cells during and prior to HIV infection are on the same order [5]. 
Without justification for a moment, we choose E = g/b where b is on the 
order of the b v . If we rewrite h and e„ as h/M and e„/E respectively in 
(2.31, we arrive at the rescaled system, 



dh = 



dP(X) - dP(Xh) 



E dP( - x ~ 



,h) 



(2.4) 



de v 



- dP(xh ev h) 
9 \ b 



dP( 



k v g 
b 



e v )+ E dP((iX^-e v ,h) 



We would like to recover a deterministic ODE from (2.4|, in this way 
showing that our present model is an extension of current deterministic 
models. In [20], Kurtz showed that one can recover deterministic pop- 
ulation ODEs by taking large population limits of stochastic population 
ODEs. In our context, we can consider E — > oo and E — > oo. Such limits 
are reasonable for HIV due to its enormous population size, but it is not 
immediately clear what the relationship should be between H and E as 
both go to infinity. 



To address this issue in a simple context, consider (2.4| without CTL 
attack. In this setting we need not distinguish between different variants, 
reducing our system to the variables h and e, and we may also ignore 
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mutation. Taking EI and E large, we can largely ignore the stochasticity 



of (2.31 and arrive at the following deterministic ODE, 



dh 
dt 
de 
dt 



5(1 
b 
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e(Xh- 



h — eh) 
b >' 



(2.5) 



which has the equilibrium, 



h = (kg)/(b\) 



(2.6) 



Consider the variables in the expression for h directly above, k, the death 
rate of infected CD4 + cells has been measured at approximately 2 days 
|28| . If we take 2 days as our time scale, we then expect k « 1. Uninfected 
CD4 + cells last on the order 2 weeks, giving g — .1 as a reasonable choice. 
Estimates for b and A have significant variation in the literature. However, 
we can understand the role of A and b in (2.4| by noting the following 
relation, 



\b = g 2 



E 



(2.7) 



which follows from our formulas for EI and E. The above relation and 



(2.61 demonstrate that (2.4| only converges to a deterministic system in 

E 



the large population limit of 
fixed constant. 



oo if the ratio H/E converges to a 



To force (2.4 1 to have a deterministic limit, we introduce a parameter 
7 = \b/g or in terms of H, E, 7 = g(El/E) (the factor g is not essential, but 
gives the system directly below a simpler form). From a biological point 
of view, 7 is an inverse measure of the fraction of infectable cells that are 
actually infected. Empirical results for the ratio of infected to infectable 
cells are difficult as most infected CD4 + are in the lymph nodes and many 
such CD4 + are infected but inactive [55]. However, estimates in the range 
of .01 to .1 have been given by several authors and seem reasonable |22| . 
With g = .1, the corresponding range for 7 is 1 to 10. Using this scaling 
of 7, our defin ition of E is justified biologically. 

Rewriting (2.4 1 using 7 and setting b v — b v /b gives 



E Vt 



dP(-yE) - dP(jEh) 



dP(b v ,jEe v ,h) (2.8) 



dP{-yb v Ee v h) - dP(k v Ee v 



+ ^2 dP(pb v ,-yEe v ,h) 
v'ev(v) . 



and we consider the limit of this system as E — > 00. 

While 2.8) approaches a deterministic limit asE — > 00 when mutation 
is ignored, the system will continue to be stochastic if fi is sufficiently large 
with respect to E. Indeed, as we show in section [3] the scaling of fi that 
produces stochasticity is precisely a scaling in which HIV lives. Roughly, 
stochasticity of ( |2.8| l exists even as E — > 00 because the dynamics of 
variants that are of scaled population size O(^), i.e. e v = O(i), will be 
stochastic even as E — > 00. 
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However, if we ignore mutation then as E — > oo, (2.8| becomes 



dh 
At 



de v 



■g(l-h - ^2 b v >e v >h) 
= -y(b v he v — k v e v ), 



(2.9) 



which has the form of a predator-prey system. ( 2.9 1 is a simplified version 



of the standard deterministic ODE used today for HIV modeling, the full 
version includes the virion population. But generally, the reduction to 



(2.91 demonstrates how (2.81 is based on current HIV models. 



2.3 Symmetric CTL Attack and Initial Condi- 
tions 



We consider ( 2.8 1 restricted to the case of symmetric attack. To make the 
notion of symmetric attack precise, we partition the collection of variants, 
£ , into subsets £i such that 

Si = {v 6 £ : v has i l's in its binary expression} (2-10) 

For example, if e = 4, then £x = {1000, 0100, 0010, 0001} and £ x = 

{1000} for the full and linear escape graph, respectively. £; is the col- 
lection of variants that are mutated at i epitopes. We refer to the £i 
generally as variant classes and £i specifically as the ith variant class. 

To model symmetric attack, we assume that a variant v £ £, will be 
exposed to CTL attack at e — i epitopes, and we assume that the death 
rate due to CTLs at each single epitope has rate Afe. We scale time so that 
infected cells die, in the absence of CTL attack, at rate 1. As mentioned, 
the lifetime of an infected cell has been shown to be approximately 2 days 
which is in turn our unit of time. All this is made precise by taking the 
death rate k v of v £ £i to be given by 

kv = 1 + (e - i)Ajfe. (2.11) 



Finally, as an added simplification, we take b v to be constant. In (2.81 
this amounts to taking b v — 1. 

Before presenting our final system, we note that mutations do not play 
a role in the equation for h and so stochastic effects will have little impact 
on h dynamics. For simplicity, and with error that goes to as E — > oo, 
we replace the h equation by its deterministic counterpart. Putting all 
these remarks together, we arrive at the following system. 

^=g(l-h-Y,^h) (2.12) 
V v'ee J 

de v = i I dP(jEe v h) - dP(k v Ee v ) + ^ dP(priRe v >h) J . 



From this point on, we take (2.121 as describing the dynamics of the HIV 
population. 



8 



To set initial conditions, we assume that variant no = 000. . . is 
the dominant variant prior to CTL attack. Indeed, CTLs proliferate in 
response to epitopes existing in the population, so taking vq to be the 
dominant variant is biologically reasonable. Ignoring other variants for a 



moment, we set h, e VQ at time t = according to the equilibrium of (2.51 



fc(0) = -, (2.13) 

7 

e ^ u >- h (0) ■ 

Prior to CTL attack, we assume that other variants are at a slight fitness 
disadvantage to vo and arise due to mutations on wo variants. Assuming, 
as we have just done, that there are O(E) v variants, there will be O(fjE) 
variants from each of the £\ classes. From this, we can conclude that the 
number of variants in the £2 class will be of order 0(^t 2 E). As we mention 
below, /i 2 E « and so we assume that no £ t variants exist at t = for 
i > 1. For simplicity we assume e„(0) = /JE for all v € £1. 

These initial conditions are not essential to our results, other choices 
are possible. Which initial conditions are appropriate will depend on the 
period of HIV infection one has in mind. We have made a specific choice 
for the sake of clarity. 

2.4 Genealogies 

When all variants are of type 11. ■ ■ 1 , the HIV population has escaped 

e 

CTL attack. We let T samp i e be a time after such an escape has been 



completed and consider n infected cells sampled at T S am P ic- Since (2.121 
defines discrete birth and death events, we can construct lineages corre- 
sponding to the ancestral lines of these n sampled cells. 

We label the lineages £1,(2, ■ . ■ ,£ n and we let Tl(t) represent the par- 
tition structure of the lineages at time t. To explain this, consider Figure 
[3] which represents a possible lineage structure for the case n — 8. The 
values of Il(t) at t = T samp i c , T B ,T C are given by, 

U(T s&mple ) = {{£ 1 },{£ 2 },...,{£ s }} 

n(T s ) = {{4,i2},{4,4},{4,4} ! {^,4}} 

H(To) = {{£l,£2,l3,l4,£5,£ 6 },{£7,£ S }} 

At time T sa mpie all lineages are separate, n(r samp i ) consequently parti- 
tions each lineage into its own set. By time Tb, the pairs of lineages 1 and 
2, 3 and 4, 5 and 6, and 7 and 8 have coalesced. H(Tb) partitions these 
pairs to reflect this structure. Finally by time Tc, lineages 1 through 6 
have coalesced as has the pair 7 and 8. n(Tc) partitions the lineages 
accordingly. 

n(i) is a random partition function that encodes the genealogy formed 



by the n lineages. Its stochasticity follows from the stochasticity of (2.121 



as well as the stochasticity of lineages given a single realization of (2.121 



9 



Sample 1 



time 



sample 



Figure 3: Example Lineages, n = 8 



3 Results 

Our results characterize the lineage structure at t = of n infected cells 
sampled at t — T samp i c , a time after HIV has escaped CTL attack. More 
precisely, Theorems |3.1| and |3.2[ provide a random partition to which 
n(0) converges in the limit E — > oo,/x — > with the limit taken so that 
/i 3 E 2 — > and fiE — > oo. We refer to this limit as the the small population 
limit, SPL. Throughout this work, whenever we take an unspecified limit, 
we mean the SPL. 

In experimental and theoretical HIV studies, E has been estimated in 
the range 10 6 — 10 8 . In [51] . the number of activated CD4+ cells with 
integrated provirus was found to average 3 x 10 7 . Various studies have 
estimated that somewhere between 1 in 1000 to 1 in 80000 CD4 + cells are 
productively infected during HIV infection, see p. 91 in |22) and references 
therein. Using a base of 10 11 infectable lymphocytes [22], this gives a range 
of approximately 10 6 — 10 8 for E. Presumably, E varies depending on the 
individual and stage of infection. Mutation rates for HIV per base pair per 
infection event have been estimated at approximately 10~ 5 .5 . Through 
numerical experiments, we show that Theorems |3.1| and |3.2| exact in the 
SPL, are a good approximation for the lineage structure formed under 



(2.12l in the parameter regime fi — W , E = 10 . In contrast, we show 
that the parameter regime fi — 10 -5 , E = 10 8 is not well approximated 
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by the SPL. The regime fi — 10~ 5 , E = 10 7 is a middle ground in which 
the SPL is a reasonable approximation, but significant error does exist. 
Therefore, we think of the SPL as being a limiting version of ( |2,12[ | when 
HIV has a relatively small infected cell population size. 



Theorems 3.1 and 3.2 do not specify the structure of Tl(t) at times 
other than t — 0. However, the arguments we use to justify these theo- 
rems do provide some results in this direction which we mention in the 
Discussion section. Similarly, while our results focus on lineage structure, 
we make some observations regarding the stochastic dynamics of ( |2.12[ ) 
in the Discussion section. As we mentioned in section 12.21 the E — > oo 



limit does not eliminate the stochasticity of (2.121 in certain parameter 
regimes for /i and the SPL is one such a regime. 

In section [3~T| we present our SPL results, while in section |3~2] we dis- 
cuss the numerical results that connect the SPL to the parameter regimes 
of HIV. 

3.1 Small Population Limit Results 

The dynamics of ( |2.12[ | are composed of successive sweeps in which each 
variant class displaces the previous variant class as the dominant portion of 
the in fected cell population. For example, Figure [4] shows a realization of 



(2.12 l for a full escape graph with e = 5, Ak = . 1, 7 = 3, g = .1, fi = 10 
E = 10 7 . The fi gure was generated by solving ( |2.12[ ) numerically. 

Since initially there are no variants outside of the 0th and 1st variant 
classes, the variants from the ith variant class with i > 1 come from 
£i-i — s> £i mutations, i.e. a mutation v' — > v with v' £ £i-i and v G £i- 
The SPL scaling forces such mutations to occur during a time interval 
when £i-2 variants dominate the population and £i-i,£i variants are at 
low frequencies. During this time interval, all variants in £j with j < i — 2 
have been driven out of the population, or nearly so, while all variants 
in £j for j > i have yet to arise. We refer to this time interval as the 
£i-i spawning phase because the rise in £i variants is being driven by 
£%-\ — s> £i mutations. At later times, once the £i population has reached 
higher frequencies, — > £i mutations have little impact on £, variant 
population dynamics and the £i-\ spawning phase ends. The condition 
/i 3 E 2 — > in the SPL insures that a variant that is being spawned cannot 
simultaneously spawn another variant. 

During the £%—\ spawning phase, variants in £i-\ are increasing in 
population size at approximately rate Ak while variants in £ j are increas- 
ing at approximately rate 2Ak. To see why, recall that the £ j_2 variants 
dominate the infected cell population during the £i-i spawning phase. 
Since £i-i and £, variants are attacked by CTLs at 1 less and 2 less epi- 
topes than £i-2 variants, their relative fitness is given by Ak and 2Ak 
respectively. These dynamics are a generalized version of the well studied 
Luria-Delbriick (LD) model (see [36] for an excellent review of LD models 
and results). The LD model assumes a wild type population growing at 
rate, say, a that produces mutant types that also grow at rate a. This 
contrasts to the growth rates Ak, 2Ak for £i-i and £, variants respectively 
in the i — 1th spawning phase. For this reason, we refer to spawning phase 
dynamics as obeying a generalized LD model. The dynamics of the LD 
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Figure 4: Dynamics of (2.12| for a full escape graph with e = 5, Ak = .1,7 = 3, 
H = 1CT 5 , E = 10 7 . 



model have been studied for many years and the LD distribution, which 
gives the number of mutant types at a given time, is well understood. 

We analyze (2.121 by decomposing the time considered, [0, T samp ie], 
into a series of time intervals [Tj, Tj+i] for i = 0, 1, . . . , e — 2 along with 
intervals [0,T ], [T e -i,T e ] and [T e , T 8amp i c ]. The interval [T„T l+1 ] is the 
£ i+ i spawning phase. Intervals [0,T ], [T e -i,T e ], [T e ,T aamp i c ] are bound- 
ary cases that do not correspond to a spawning phase. In this way, we 



reduce (2.121 to a sequence of spawning phases. 



For each vertex v, we define the pop value of v as the number of v vari- 
ants at the beginning of the v spawning phase. More precisely, if v £ 
then the pop value of v is e„(Ti) because T,: is the beginning of v's spawn- 
ing phase. For the linear escape graph, we can express the distribution of 
the nth pop value through a simple formula that is independent of other 
pop values, see ( 5.6 1 . In the case of a full escape graph, the distribution of 



the vth pop value is given by an iterative formula that depends on other 



pop values, see (6.2| 



Pop values help us get a handle on the stochasticity of (2.121. As E 
becomes large, the stochasticity of (2.121 becomes restricted to variants 
of small population size. In our terminology, the stochasticity of (2.121 



becomes restricted to spawning phases and their corresponding general- 
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ized LD dynamics. For an interval [Ti,Ti+i], pop values describe variant 
population sizes at Ti and Ti+i, thereby giving us a handle on the LD 
dynamics that occur between these two times. 

We use extensions of previous LD results to derive our pop value for- 
mulas. However, these formulas connect to dynamics and we are interested 
in forming lineages. Correspondingly, we need to understand not only the 
dynamics of the LD model but also how to construct lineages on a gen- 
eralized LD model. The random partition Sx,i, which we discuss more 
thoroughly below, characterizes the coalescent events on lineages as they 
move backwards in time through generalized LD dynamics corresponding 
to a single spawning phase. To form lineages, we consider a sequence of 
spawning phases. For a linear escape graph, this is done through simple 
concatenation of Ha,i- But for the full escape graph, things are more 
complicated as variants within a variant class affect each others dynamics 
and hence each others lineages. 

3.1.1 linear escape graph 

As mentioned, constructing lineages for linear escape graphs is just a mat- 
ter of concatenating the coalescent events associated with each spawning 
phase. We characterize such coalescent events through the random parti- 
tion EU,i which we now define. 

For i = 0, 1, . . . , e — 2 we define a r.v. r' ! ' by, 



where Wi, W2 are independent exponential r.v's with mean 1 and B(p) is 
an independent Bernoulli r.v. with success probability p. 

We define Ea,j through a paintbox construction as follows (see |17II29| 
for a review of paintbox constructions). 

Definition 1. Let A > and i £ {0, 1, . . . , e — 2} be given. Then we 
define a partition HA,i(<S) on any set S as follows. Let K be a sample 
from a Poisson r.v. with mean A. For j = 1, 2, . . . , K we sample Fj from 
the r.v. 

Thinking of the j as colors, we 'paint' each s £ S a random color 
according to the probability 



The partition Ha,i(5) is formed by grouping together elements sharing the 
same color. 

The K samples of F' 1 ' correspond to K, £i+i — > £i+2 mutations on 
the spawning phase, [Ti, Ti+i]. Each r'^ sample is proportional to 
the number of descendants at Ti+i produced by a single such mutation. 
Essentially, the r' 1 ' samples provide a decomposition for the pop value of 
v G £i+2- More precisely, 




(3.14) 



P (paint with color j) 



T K r 



(3.15) 



K 




(3.16) 



13 



where C is a constant independent of j. The decomposition (3.161 al- 
lows us to construct lineages, while simply sampling the pop value would 
not. Roughly, the number of infected cells at Ti+i that descend from the 



jth £i+i — > £i+2 mutation is proportional, with constant C in (3.16l, to 
Tj. This means that a sampled cell will descend from mutation j with 
probability given by ( |3.15[ ). Sampled cells that descend from the same 
mutation on [Ti, Tj+i] must coalesce during that period. In this way S Aj i 
characterizes coalescent events on [Ti,Ti+i]. 

The parameter A is a tuning parameter. As Theorem |3.1| shows , raising 
A improves accuracy by considering more mutations, but at a computa- 
tional cost of increasing the number of samples that must be taken. 

For the linear escape graph, 11(0) is simply a concatenation of the Ha,;. 



Theorem 3.1. Consider (2.12) assuming a linear escape graph. Then 



letting A be any partition of the n lineages, 



limP(n(0) = A) = P(mH Aii "j (n(T sompie )) = A) + C>(i) (3.17) 

where Sa,i is given by definition^ 

Recall that n(T samp i e ) simply partitions each lineage separately since 
no coalescent events have occurred. By (nr=o (n(T , samp i e )) we mean 

the concatenation of the H A ,i applied to n(T samp i c ). For example if e = 3 
then, 



Yl E A ,i (n(T samp i c )) = H A ,o(HA,i(n(T samp i c ))). (3.18) 



3.1.2 full escape graph 

As mentioned, the i + lth spawning phase involves £i+\ — > £t+2 mutations. 
For the linear escape graph, there is only one variant in each variant class, 
meaning that there is only one type of £i+i — > £i+2 mutation. However, 
for the full escape graph we must consider v' — > v mutations for every 
v G £i+2 and v' G V(v). The time period [T^, Ti+i] will be composed of 
many concurrent spawning phases, one for each such v — > v. 

To explain the generalization of Theorem |3.1| to the full escape graph, 
recall that for the linear escape graph H A ,i is formed by taking K samples 
of and K is always sampled from a Poisson r.v. with mean A. In 
the full escape graph case, for each v G £i+2,v' G V(v) we take K v i^ fV 
samples of F*- 1 ', where K v /^ v is sampled from a Poisson r.v. with mean 
that depends on the pop value of v' relative to the other vertices in the 
£i+i class. 

To explain why K v i^ v should depend on pop values, consider v',v" G 
V(v). Suppose e v /(Ti) 3> e v ii(Ti). In other words, v' has a much higher 
pop value than v" . A higher pop value will mean that on [T^T^+i], more 
v — > v mutations occur then v" — > v mutations and correspondingly we 
should have K v /^, v > i^„»_>„. This effect did not arise in the linear escape 
graph because each variant class contains a single variant. 

Since K v >^ v depends on pop values, intuitively we must first sam- 



ple pop values and then construct the K v i^, v . However, as in (3.161, 
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to form lineages we do not sample pop values. Rather we decompose 
each pop value according to the number of descendants produced by each 
v — > v mutation. The decomposition depends on K v i^ v . Putting these 
comments together, we must build and pop value decompositions 

simultaneously. This is done in Definition [2] The variable D v is pro- 
portional to the pop value of v and is formed through a decomposition 



analogous to (3.161 



Definition 2. For each v G £\ we define D v — 1. Then we recursively 
define D v , K v i^ v and F*- 1 ' samples as follows. Suppose the D v values are 
known for v G Set 

Dmax.i-i = max D v (3.19) 

For each v 6 £i and v' £ V{v) we let K v i^ v be a sample from a Pois- 
son r.v. with mean A(D v i / .D max ,i_i) • For each j — 1, 2, . . . , K v i^ tv , we 
sample Y v i^ V j from F^' . Then, 

K v'->v 

D v = Y r v^v,j (3.20) 

v'eV(v) j=l 

The above definition allows us to define K v i^ v and F samples for every 
mutation pair v' —¥ v. The pop value of v G £i+i is given by, 

e v (Ti) = dD v = d r v->vj, (3.21) 

v'GV(v) j=l 



where d depends only on i. (3.211 is analogous to (3.16l. 

For the full escape graph, the state of our lineages is not simply a 
partition of {£i,£z, ...,£„}■ Rather, we must specify a vertex to which 
each lineage is associated at a given time t. Intuitively, the vertex associ- 
ated with, say, lj at time t is the variant type of the infected cell at time 
t from which the jth sampled cell descends. To put this in the context 
of a partition function, FI(t) partitions the lineages into disjoint sets and 
associates with each such set a vertex in £. The Ha,; defined below are 
random partitions on sets for which every element is associated with a 
vertex in £i+2- 

Definition 3. We define a partition SA,i(5) on a set S for which each 
element s g S is associated with a vertex v a 6 £i+2- For every s,v a pair 
and v' G V(v s ) let K v i^ Vs , D Vs and associated T v r^, Vs j be as defined in 
Definition^ Assign a unique color to every triple (v',v 3 ,j). Then we 
paint each element s G S the color associated with (v',v s ,j) and assign it 
element v with the following probabilities, 

r ; 

P(paint with color (v ,v s ,j) and assign vertex v ) = " — - (3.22) 

The partition Ha,i(S) is formed by grouping together elements sharing the 
same color. 

With the adjusted definition of Ha,*, the statement of Theorem |3.1| 
now holds for the full escape graph. 
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Table 2: Parameter Regimes 



Theorem 3.2. Consider ,'2.12) assuming a full escape graph. Then let- 



ting A be any partition of the n samples, 

UmP(n(0) = A) = P(||]H A ,,j (H(T sample )) = A) + 0(1) (3.23) 

where Sa,i is given by definition^ 

For the full escape graph, il(T samp i c ) partitions each lineage separately 
and assigns to each lineage the vertex 11. . . 1 since we sample after HIV 
has escape CTL attack. A should assign to each lineage a variant of class 
So or £i since these are the only variants extant at t = 0. However, for 
simplicity Theorem |3 . 2| refers to the partition structure of the lineages at 
t = 0. (If we wanted to include the variant associated with each lineage 
at t = 0, we would need a Ea,-i- Our methods allow for this, but for the 
sake of simplicity and because our initial conditions are slightly ad-hoc, 
we do not consider such an extension.) 



3.2 Numerical Results 

In this section we consider five parameter regimes: the approximating 
regime (AR), the small population regime (SPR), the medium population 
regime (MPR), and the large population regime (LPR). Table |2] specifies 
the [i and E value associated with each regime. The table also includes 
the corresponding values for /i 3 E 2 and fjK. The AR has /x 3 E 2 <C 1 while 
fiE 2> 1, suggesting a good approximation by the SPL. Notice that the 
SPR has a scaling near the SPL, but that the LPR has a /i 3 E 2 value of 
10 which, as we shall show, is too large for the SPL to apply. 

To understand the accuracy of the SPL, we first consider the probabil- 
ity that two sampled lineages coalesce. More precisely, setting n = 2 we 
consider the probability that l\ and I2 coalesce by t = or equivalently 
P(n(0) = {{^1,^2}})- This probability is often computed in population 
genetics applications and is one way to characterize n(0) [35]. Figure [5] 
shows this coalescent probability for a linear escape graph with 7 = 3 and 
AA: = .1. The x-axis gives the number of epitopes in the linear escape 
graph, our parameter e. We skip e = 1 because due to our initial con- 
ditions such an attack has a coalescent probability near zero. For each 
value of e considered, we computed five quantities given by the five bars. 
The left most bar represents the coalescent probability given in the SPL, 
as specified through Theorem |3.1| To compute the coalescent probability 



16 



in this case, we set A — 100, higher values of A don't change the result, 
and constructed Ha,i for i — 0, 1, . . . , e — 2. This amounts to sampling 
the r.v.'s r' 1 '. We generated 1000 realizations of the sequence of Ha,z. 
For each realization, we then applied the paintbox construction implied 
by the underlying r' 1 ' samples to determine if the two lineages coalesced. 
We did this 1000 times for each realization of the Ha,,. In this way we 
computed one million l's, for coalescence, and 0's, for non-coalescence. 
Averaging this list gave us the coalescent probability. 



0.9- 



0.8- 




number of epitopes attacked 



Figure 5: The probability of coalescence of two lineages for a linear escape graph 
with 7 = 3, g = .1, Ak — .1. The bars, from left to right, give the coalescent 
probability under the SPL, AR, SPR, MPR, and LPR (see Table |] for the 
definition of these parameter regimes). 



The next four bars represent, from left to right, the coalescent prob- 
ability for the AR, SPR, MPR and LPR, respectively. These values are 
computed by solving ( |2.12[ | numerically and forming lineages on top of the 
stochastic dynamics. We compute 1000 realizations of (2.121 dynamics, 
and for each such realization we consider the coalescence of 2 lineages 1000 
times. We then average over all 1000 lineage pairs and all 1000 realiza- 
tions. Solving (2.121 and building lineages on top of the dynamics is not 
numerically trivial due to the large population size. Following methods 
described in [2T], we solve ( |2.12[ ) exactly and track parent-child relation- 
ships in each variant until the variant population size exceeds 10000. At 
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that point we switch to the deterministic ODE analogue of (2.121. 

As Figure [5] demonstrates, the SPL is a good approximation in the 
AR and SPR, but not the LPR. The MPR represents a middle ground. 
Figure [6] is the same as Figure [5] except that in Figure [6] we consider a 
full escape graph. 



0.9- 



0.8- 




number of epitopes attacked 



Figure 6: The probability of coalescence of two lineages for a full escape graph 
with 7 = 3, g = .1, and Afc = .1. The bars, from left to right, give the 
coalescent probability under the SPL, AR, SPR, MPR, and LPR (see Table [2] 
for the definition of these parameter regimes). 



Another value that can be used to characterize the HIV genealogy 
shaped by CTL attack is the number of still uncoalesced lineages at t = 0. 
More precisely, we consider the number of elements in n(0). Recall that 
each element of n(0) is a collection of lineages that have coalesced. Figure 
[7|shows the distribution of this number for a full escape graph with e = 3, 
7 = 10, Afc = .3 and n = 100. The same pattern of accuracy is seen as 
with Figures [5]and [5] 

Theorems ; 1 3 . 1 1 an d 1 3 . 2 1 provide a theoretical framework for understand- 
ing genealogies on ( |2.12[ ). However, they also provide a computational 
approach for sampling such genealogies that is much faster than solving 
( 2. 12| | directly. Table|3] gives the CPU time in seconds required to generate 
the coalescent probability results shown in Figures [5] and [6] for the cases 
e = 2, 6. We show CPU times needed to produce the probability through 
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Figure 7: The number of uncoalesced lineages at t = assuming a sample of 
100 infected cells after HIV escape. Results correspond to a full escape graph 
with e = 3, 7 = 10 and Ak = .3. The bars, from left to right give values under 
the SPL, AR, SPR, MPR, and LPR (see Table [2] for the definition of these 
parameter regimes). 



our SPL results and by solving (2.121 in the SPR, the times required for 
the APR, MPR, and LPR are similar to the SPR. As can be seen, the 
SPL approach is more than 300 times faster in the case of a full escape 
graph and e = 6. For the linear escape graph and the e = 2 full escape 
graph, the SPL approach is on the order of 100 times faster. 



4 Discussion 

Application of the results we have presented depends on approximating 
the SPL scaling by satisfying fiE 3> 1 and ^t 3 E 2 -C 1. HIV almost certainly 
always satisfies fiE 3> 1, so this condition is not restrictive. On the other 
hand, /i 3 E 2 <C 1 is satisfied if the HIV population size is of relatively small 
magnitude, namely on the order of 10 6 . Importantly, the HIV population 
size we must consider is the number of active CD4 + cells infected with 
functioning HIV genome. Inactivated CD4 + cells or those infected by non- 
functional HIV do not enter into our model because they do not produce 
offspring infected cells. 
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11300 


84 
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32 


5300 
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54000 


340 



Table 3: CPU Time in seconds needed to generate coalescent probabilities. CPU 
is an Intel-Celeron single node processor. 



Our results have implications for both dynamics and genealogies. For 
dynamics, our arguments show that the stochasticity of (2.12 1 in the 
SPL is completely contained within the pop values described in the re- 
sults section and defined precisely in ( |5.6[ ) and ( |6.2[ ). Intuitively, once a 
variant population reaches large size, averaging effects take over and de- 
terministic dynamics apply. In our nomenclature, a variant population 
is small and hence experiences stochastic dynamics only when it is being 
spawned by another variant population. These spawning dynamics are en- 
coded in the pop values which are stochastic. Taking all this together, if 
we are interested in dynamics and not lineages, then < |2.12[ ) can be reduced 
to a deterministic ODE accompanied by stochastic pop values. 

The stochasticity of the pop values has significant impact on the dy- 
namics of ( |2.12[ ). Figure [8] gives a solution for the deterministic analogue 
of (2.121 in which stochastic events are replaced by their average. An- 
other way to describe such a system is as (2.121 when all pop values are 
equal. Either way, since our equations are symmetric, the dynamics must 
be symmetric and this is indeed the case in Figure [8] All variants within 
the same variant class have identical dynamics. 

In contrast, Figure [9] provides the dynamics for a single realization of 
(2.121. We can see that stochasticity plays an essential role in (2.121 be- 
cause Figure [9] gives very different dynamics than Figure [8] But further, 
our work explains the stochasticity seen in Figure [9] In a given variant 
class, some variants have higher pop values than others. Such variants 
dominate the others in their class. For example in Figure [9] the variant 
Oil dominates 110, 101 when these variants compose most of the popu- 
lation. This dominance results from stochasticity corresponding to a high 
pop value. Biologically, the stochasticity of pop values come from the 
stochasticity of mutation times. 

Turning now to genealogies, we have described the coalescence of lin- 
eages caused by the whole period of HIV escape. However, as mentioned, 
we can decompose HIV escape into time intervals [Ti,Ti+i]. Each such 
period corresponds to Hji,« and so we know the state of the lineages at 
each Ti given the state at Ti+i. Between the Ti, however, our results do 
not describe the lineages. Figures [TTJ] and [TT] show genealogies formed for 
a 5 epitope and 2 epitope attack, respectively, in the case of a linear es- 
cape graph under the SPR. Here we have shown all coalescent events that 
happen during [T», Ti+i] to occur at T. Both genealogy figures were pro- 
duced using Figtree. (Figtree is available as part of the BEAST software 
package [7].) 
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Figure 8 
.1,7 = 3, /T=10 



(2.12) run deterministically for a full escape graph with e 

5 



E = 10 6 

same epitope class evolve identically. 



3, Ak = 

As a consequence of symmetry, all variants in the 



The restriction of our current results to symmetric attack and the small 
end of the HIV population size range is a significant limitation. Further 
work should allow for these restriction to be lifted, but our current results 
provide some general observations. 

For a full escape graph, the assumptions of symmetric attack makes the 
paths through the graph identical in terms of the underlying parameters. 
Removal of the symmetric attack assumption would lead to a dominant 
path. For example, if there is an epitope that is attacked more strongly 
than the other epitopes, then it will be the first epitope at which HIV 
escapes CTL attack. Of course, there will be some HIV variants that 
initially posses a mutation at a different epitope, but these will be few in 
number. The order of the epitopes at which HIV escapes from the CTL 
attack will be specified in the case of asymmetric CTL attack. As a result, 
HIV escape on a full escape graph in the asymmetric attack case should 
proceed essentially on one path of the graph and be similar to the linear 
escape graph dynamics and genealogies we have discussed. 

Our numeric results allow us to compare the form of genealogies for 
large and small HIV populations. Figure [12] compares coalescent proba- 
bilities for linear and full escape graphs. This is the same data presented 
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Figure 9: pl2 | 
H = 1CT 5 , E = 10' 
symmetry of the model is broken by stochastic effects 



run for a full escape graph with e = 3, Afc = .1, 7 = 3, 
Unlike the deterministic case shown in Figure |8j the 



in Figures [5] and [5] In Figure [l2| the four bars give, from left to right, the 
coalescent probability for a linear escape graph under SPR, a full escape 
graph under SPR, a linear escape graph under LPR, and a full escape 
graph under LPR. As can be seen, the coalescent probabilities under the 
SPR are similar for the linear and full escape graphs. Some numerical 
experiments suggest that this is because pop value stochasticity causes a 
single path through the full escape graph to dominate, similarly to our 
earlier comments on asymmetric attack. We don't know why this is not 
the case for the LPR. It may be that pop values take on a different form 
in this regime to which our SPL analysis does not apply. 



5 Linear Escape Graph 

In this section we consider ( |2.I2[ | for the linear escape graph under the 
SPL. Our main aim is to explain and demonstrate Theorem |3.1| For 
notational simplicity, we set Vi = 11 ... 1 00 ... 0. In this subsection we 



write ei for e v , in (2.121. For each variant class Si we define Ti for i 



1, 2, . . . , e as the time at which variant v t reaches scaled population size 
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7.0 



Figure 10: Sampled genealogy for a linear escape graph with 5 epitopes attacked 
under the SPR, that is fi = 1(T 5 ,E = 10 6 . 7 = 3, g = .1, and Afc = .1 as in 
Figure [5j n = 20. The time scale at the bottom is in units of 2 days. 



S, 

Ti = ini{t : e Vi > 8}, (5.1) 

where 

^(u^i(W) 2 (5 - 2) 

The value of 8 can fall within a range of values, the formula above is a 
specific choice within this range. Intuitively, 8 represents a microscopic- 
macroscopic cutoff. Different variants 'interact' in (2.121 through the h 
equation. When a variant has population less than 8, its impact on h 
dynamics and in turn on the dynamics of other variants is small and can 
be ignored in the SPL. From this perspective, the smaller 8 the better. On 
the other hand, a 8 that is too small will make the interval [Ti,T;+i] too 
short in the sense that the Vi+i — > Vi+2 mutations that drive the i + 1th 
spawning period will not have finished by Ti+i. From this perspective, 
the larger 5 the better. Our choice for 8 is a middle ground between these 
two extremes. 

In the SPL, 8 — > 0. Variants with scaled population size less than 8 
collapse as a percentage of the population in the SPL. This is why we 
think of 8 as a microscopic-macroscopic cutoff. However, if a variant has 
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4.0 



Figure 11: Sampled genealogy for a linear escape graph with 2 epitopes attacked 
under the SPR, that is fi = 1(T 5 ,E = 10 6 . 7 = 3, g = .1, and Afc = .1 as in 
Figure [5j n = 20. The time scale at the bottom is in units of 2 days. 



scaled population size 8 then the number of such variants, unsealed, is <5E 
which goes to 00 in the SPL. So while 'microscopic' variants are few as a 
percentage of the population, they may have large population sizes in an 
absolute sense. 
We also set 

To = inf{t : ei(t) = ^ 2 ES 2 }. (5.3) 

To is a special case because we set ei(0) = /iE. 

Using the T; we decompose [0, T aamp i e ] into intervals [Ti_i,Ti] along 
with initial and final intervals [0, To] and [T e , T sam pie] respectively. That 
this composition is valid with probability 1 in the SPL, i.e. 

P(T < Ti < ■ ■ ■ < T e < T samp i c ) ->■ 1, (5.4) 

will be a consequence of our analysis below. 

We consider the interval [Ti, Ij+i] for i = 0, 1, . . . , e — 2. The intervals 
[0, To], [Xe-i, T e ], [T e , T samp ie] are handled separately. We show below that 
during [Ti, Ti+i], only the variants Vi-i, Vi, and Vi+2 play a significant 
role in the dynamics. Table[4]shows the scaled population sizes of different 
variants at T and Tj+i. The arguments that justify Tableware given 
below, for now we focus on intuition. 
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Figure 12: Comparison of the probability of coalescence of two lineages for a full 
escape graph and a linear escape graph with 7 = 3, g = .1, and Afc = .1. The 
bars, from left to right, give the coalescent probability under a linear escape 
graph and SPR, full escape graph and SPR, linear escape graph and LPR, and 
full escape graph and LPR (see Table [2] for the definition of these parameter 
regimes). 



To explain Table gj we first consider the Vi-i and Vi variants. If 



only one variant type exists in whole population, say v, then (2.121 is 
composed solely of the equations for h and e v and in equilibrium we 
have e v ~ (1 — h)/h. Examining Table [4] we see that at T;, is 
roughly at this equilibrium, meaning that it is the dominant variant in 
the HIV population. On the other hand, Vi variants at time Ti are few 
since 8 <C (l — h)/h. However, by time Ti+i, the situation has flipped with 
Vi dominating the population and pushed to low levels. Intuitively, 
the Vi variants are more fit and push out the Vi-i variants during [Ti, Ti+i]. 

Now consider the Vi+i,Vi+2 variants. Recalling that E&e+i(Ti) gives 
the number, unsealed, of v i+1 variants. We first note that 

Ee i+ i(Ti) = 0(/x 2 E 2 <5 2 ) < -. (5.5) 

A 4 

The directly above is justified in the SPL since /x 3 E 2 — > 0. Since the 
probability of mutation is fj,, the inequality above shows that at Ti the 
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variant 


e-(Ti) 


e-(T i+ i) 


Vi-l 


(1 - h{Ti))/h(Ti) + 0(5) 


0(5) 


Vi 


5 


(l-h(T l+1 ))/h(T t+1 ) + 0(6) 


Vi+l 


0(^ z ES z ) 


5 


Vi+2 





0(^EP) 


Vj for j > i + 2 








Vj for j < i — 1 


o(5) 


o(6) 



Table 4: Dynamics of (2.12| during [TjjTj+i] for a linear escape graph. 



rate of Vi+i — > Vi+2 mutations goes to in the SPL. However, notice that 
Ee i+ i (TO w 0(^ 2 E 2 ) — >• co, meaning that e,+i dynamics are deterministic 
at time Tj. 

Turning to we see that at Ti no such variants exist. However, by 
time Ti+i, there are enough such variants to make their dynamics deter- 
ministic. Connecting to our comments in the Results section, [Ti,T i+ j] is 
the i + 1th spawning phase or, slightly more explicitly, the Vi+i — > Vi+2 
spawning phase. 

Finally we note that Table [4] shows that all other variant types are of 
negligible population size. Vj with j > i + 2 have yet to arise and Vj with 
j < i — 1 have been previously driven to low levels by fitter variants. 

Table [4] provides the outlines of an iteration, as we proceed through 
different values of i, that allows us to analyze the stochastic dynamics of 



(2.12 i. The key to deriving the table is an estimate of e;+2(Ti+i), the pop 

Pi+2, (5.6) 



value of v i+ 2- In subsection |5 . 1 1 we show 

e-i+2 



where 

and where S(a, /3, c) is the stable distribution with index a, skewness 
parameter f3, and scale factor c [24] , 

In subsection |5.1| we provide the arguments that justify Tableland 



(5.61. Then in section 5.2 we use the results of section 5.1 which center 
on the dynamics of (2.12 l, to demonstrate our lineage result, Theorem 

mm 

5.1 Dynamics 

The goal of this section is to prove Proposition[l]which is a precise version 
of Tableland jjTgfr . 



Proposition 1. Consider \2.12\ on a linear escape graph. Assume that 
atTi fori = l,...,e-2 

1. P(e j (T i ) = 0) -> 1 for j > i + 2. 

9 e i+1 (Tj) p 



2G 



3. ei{Ti) = S 

I e i -i{T i )= 1 -=$»+0{5). 
5. e 3 {Ti) = 0(6) forj<i-l. 
Then at time T+i we have the following conclusions 
1. P( ei (T i+1 ) = 0)->l/orj>i + 3. 



2. (5.6) holds 



3. e i+ i{T i+1 ) = 6 

4. ei (T <+1 ) = +0(5). 

5. e 3 (T l+1 ) = 0(6) for j < i. 

Conclusion 3 of Proposition [l] holds by the definition of T i+1 . Con- 
clusions 4 and 5 could be phrased in terms of the SPL, for instance 
ej(Ti)/S — > 0, but the o() notation is, to our taste, clearer. 

We can apply Proposition [T] recursively to characterize the dynamics 
at each time T for i — l,...,e — 2. The case i = 0, which we must 
consider to start the recursion, is handled through the same arguments 
that give Proposition]!] except that our assumptions are slightly different. 
Namely, at To we have ei(Tb) = /i 2 E<5 2 by definition of To, which parallels 
assumption 2 of Proposition[I]and eo(To) = (l—h)/h+o(S) which parallels 
assumptions 3. Assumption 1 of Proposition [l] holds for the i — case, 
while assumptions 4 and 5 are not applicable. 

To explain Proposition [I] we s plit [T,T+i] into two time intervals: 



[T;,Tf] and [Tf,T i+1 ]. In Lemma 



5.1 



show that during [T^T/ 1 ] the 



Vi variant displaces the Vi-i variant as the dominant variant, as alluded 
to in Tableland the accompanying discussion. This transition happens 
quickly, so that T/ 1 — T is small. As a result, the Vi+i population does 
not grow in size muc h and no Vi+i — > Vi+2 mutations occur. Through the 



arguments of Lemma 5.2 we show that Vi+i,Vi+2 dynamics on [TJ ; , 



obey the generalized LD dynamics discussed in the Results section. 

Lemma 5.1. Adopt the same assumptions stated in Proposition^ Set 
— Ti + (3/Afc + - p )\ log(5)| where p is given in Ik .l.ty. Then, 

1. forte [Ti,T?], iMe i+ i{t) -> 0, 

2. P{e i+2 (T?) = 0) 1, 

3- e i (T?)= 1 -^tl + 0(S) 
4. e l - 1 (T?)<6. 

Conclusions 1 and 2 of Lemma |5 . 1 1 guarantee that no Vi+i — > Vi+2 mu- 
tations occur during [T^T/ 1 ] and follow from the small population size of 
Vi+i variants at T. Indeed, by assumption we have ei+i(Tj) = 0(/^ 2 E<5 2 ). 
The number of Vi+i variants grow exponentially on [T,T h ]. However, 
T/ 1 — Ti = 0(\ log(5)|) and so despite their exponential growth the num- 
ber of Vi+i variants remains small. More precisely, referring to (2.121, we 
note that h is bounded above by 1 and so we can bound the exponential 
growth rate of e;+i by 7, 

-^-<7e i+ i. (5.8) 
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Integrating the above equation and using the assumption on a+i(Ti) 
gives, 

e.+i(i)<O(^ 2+7 )^0, (5.9) 

where we have used the observation 

f/E « (n 2 E)fiE = v a E 2 -> (5.10) 

to justify convergence to in the SPL. This gives conclusion 1. 

Conclusion 2 follows almost directly from conclusion 1. We recall from 



(2.12l that v i+ i — > v i+ 2 mutations arise at rate fiKrfhei+i. The number 
of such mutations in the interval pi, TJ ] is then a Poisson process with 
mean, 

OijM) J dsh(s)e i+1 (s) (5.11) 

Plugging in our bound from ( |5.9[ ) shows the mean number of mutations 
to be bounded by 0(fi s E 2 ), here we've ignored S factors. Taking the 
SPL gives conclusion 2. 

To explain conclusions 3 and 4 of Lemma [5. 1| we notice that only vari- 
ants Vi and Vi-i are of order greater than S throughout [T^T/ 1 ]. Ignoring 



the other variants then, our ODE (2.121 reduces to three equations in- 



volving ej,ej_i,/j. Since variant Vi has one less epitope exposed to CTL 
attack than Vi-i, it will eventually push the Vi-j to extinction. Initially, 
ei(Ti) = S. Since the CTL kill rate of Vi-i variants is Afc greater than 
those of Vi variants, initially the t>i variants grow exponentially with rate 
Ak + 0(5). It then takes 0(^r| log(<5)|) time for the m population to 
rise to 0(1) levels, push out the Vi-i population, and near equilibrium 
with respect to h. This explains the order of T/ 1 . The exact form of Tj 
is explained in section |A.1| of the appendix as are the technical details 
demonstrating conclusions 3 and 4. 

Now we consider [T^,Ti+i] through the following lemma. 
Lemma 5.2. Assume the conclusions of Lemma \5.1\ Then, 
1. e i+ i(T i+ i) = 5 

m e i + 2(Tj + l) p 

3. ei(T <+1 ) = +0{5). 

4. ej(T l+1 ) = o(S) for j < i. 

Conclusion 1 of Lemma |5.2| follows from the definition of Ti+i. As- 
suming conclusion 1, we see that variants remain at 0(S) levels 
throughout [T 4 , Tj_|_i]. As a result, the approximate equilibrium of e t ,h 
which exists at T/ 1 is maintained. Further, variants Vj for j < i continue 
to drop in number as they are less fit than Vi variants. These observations 
justify Conclusions 3 and 4. 



We have left to consider Conclusion 2. From (2.12 1 we have the fol- 
lowing ODEs for e i+1 ,e i+ 2, 

de l+l = 7 e 1+1 (h- ^±i), (5.12) 
7 

de l+2 = g fdP(jhEe l+2 h) - dP(h +2 Ee l+ 2) + dP{fiyhEe l+1 )\ . 
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Note that ej+i is given by a deterministic ODE because Eej+i(Tj) — > oo. 
By conclusion 3, since e;, h are near equilibrium, we know h = ki/ r y+0(5). 
Plugging this result into ( |5.12[ ) gives, 

de i+ i = (Ak + 0(5))e l+1 , (5.13) 
de l+2 = g fdP((fo + 0(5))Ee i+2 ) - dP(h +2 Ee t+2 ) + dP(n(h + 0(6))®* 

If we label the number of Vi+2 variants as ef +2 , by our scaling ef +2 = 
Eei-1-2, the ei+2 equation in ( |5.13| l transforms into, 

deT +2 = dP((ki + 0(6))ef +2 ) - dP(k l+2 ef +2 ) + dP{n(kn + 0(5))Ee i+1 ), 

(5.14) 

and we find that Vi+2 variants evolve according to a binary branching 
process with birth rate ki + 0(5), death rate fci+2, and mutation rate that 
creates new v i+ 2 variants fj,(ki + 0(5))Eej+i. 

Ignoring the 0(S) term in the rates, the growth rate for Vi+2 variants, 
which we label as Ti+2, is = ki ~ ki+2 = 2Afc. Considering Vi+i — > 
Vi+2 mutations, we have for the rate 

rate Vi+i — > Vi+2 at time t m fikiEei+i(t) (5.15) 

= fikiEe.+iiT?) exp[Afc(t - T?)]. 

By the assumptions of Lemma |5.2| there are no v i+ 2 variants at time 
T, h . The Vi+2 population arises from mutations in the Vi+i population 
which expands at rate Ak. Such mutations produce Vi+2 cells that then 
expand at rate 2Afc, precisely the generalized LD dynamics mentioned in 
the Results section. 

We define LD c i ass ic(i) to be the number of mutants at time t for the 
LD model in which mutants and wild types grow at the same rate. In 
[161 123] the following asymptotic formula was derived under the further 
assumptions that wild types grow deterministically and mutants grow 
stochastically, but with no death events. 

TV 

LD c i assic (t) « mlog(m) + mS(l, 1, -) (5.16) 

where m is the expected number of mutations on the time interval [0, t] 
for a wild type population that is of size 1 at time 0. The relative error 
of ( |5.16[ ) goes to as m — > oo. 

In contrast to LD c i ass i c (t), we let LD2(t) be the number of mutants at 
time t for the generalized LD model in which mutants grow at double the 
rate of wild types. As in the case of LD c i aas i c , we will assume that wild 
types grow deterministically, matching the deterministic growth of 
variants as they spawn. However, for LD2 we will assume that mutants 
have non-zero birth and death rates, corresponding to the situation for 
Vi+2 variants. In section [A.2| of the appendix, using generalizations of the 
techniques found in [23] . we show 

U>2(t)^m 2 S( l -,l,-K 2 (5-17) 
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As for (5.161, the relative error of (5.171 goes to as m — > oo. 

ef +2 {t) is approximately an LD2(t) process on [T/^Tj+i] and becomes 
exactly so in the SPL. In this setting, m is the expected num ber of 
Vi+i — > Vi+2 mutations during [Ti,Ti+i]. We show in sectio n 
appendix, m ss (^)/iE. Plugging this value of m into (5.171, we find 



of the 



.iH^V^^ff ))• '■'■ isi 



Recalling that ei+2(£) = e|^_ 2 (t)/E gives conclusion 2 of Lemma 5.2 



5.2 Lineage Construction 

To demonstrate Theorem |3.1| we need some additional lineage notation. 
Recall that we consider n lineages, £j for j = 1, 2, . . . , n, corresponding to 
the n sampled cells. We let be the ancestral cell at time t of sample 
cell j. (To make this precise we could number the cells in our process 
as they are born, and then £j(t) would map to N, but we will not make 
this explicit.) We let V{lj{t)) be the variant type of £j(t). For example, 
V(£(T samp i c )) = 11. . . 1. We write £,-, dropping the time dependence, 
when we are considering the lineage over a range of times. 

To combine the separate lineages into a genealogy, we need to identity 
mutation and coalescent events. A mutation event on tj occurs at time t 
if the variant of £j changes at time t, more precisely V(£j(t—)) ^ V(£j(t)). 
Given two lineages £j,£k, the lineages have coalesced by time t if lj{t) = 
£k(t) and we say that the lineages coalesced at time t if for t' > t, I jit!) 7^ 
£ k (t'). 

We prove Theorem |3.1| by considering mutation and coalescent events 



on the interval [Tj, Tj+i]. Lemma 5.3 sets up this analysis by showing that 



no mutation or coalescent events occur on [T e _i, Trample]- This allows us 
to consider the interval [T B _2,T e -i] with all lineages of type v e at T e -i. 

Lemma |5.4| provides two results for the general setting of an interval 
[Ti,T i+1 ], assuming that all lineages are of type v i+ 2 at T i+1 . First, by 
time Ti, all lineages are of type Vi+i. This result implies that each lineage 
must experience a Vi+i — > v i+ 2 mutation during [Ti,T i+ i]. Second, two 
lineages, £j , £^ , coalesce during [Ti , Ij+i] if and only if their associated cells 
at Ti+i, £j(T i+ i), £ k {T i+ i), descend from the same Vi+i — > v i+ 2 mutation. 
In other words, the lineages coalesce if there is a cell that is of variant type 
t>i+i which produces a child cell of type v i+ 2 from which £j(T i+ i), £k(T i+1 ) 
are both descended. Lemma |5 ,4| reduces the analysis of coalescent events 
on [Ti,Ti+i] to the analysis of mutation events and the number of their 
descendants at T i+ \. 

( |5.18 \ gives an asymptotic description for the number of descendants 



at Ti+i produced by all — > Vi+2 mutations during [Ti, Ti+i]. In other 
words, the pop value of v i+ 2- Lemma |5.5| describes II(Ti) given H(T i+ i) 
by considering each such mutation separately. To do this, we decompose 



(5.181 into a collection of single mutation results as described in (3.161 



Through this decomposition, by exploiting Lemma |5.4[ we characterize 



coalescent events on [Ti,Ti+i]. By repeatedly applying Lemmas 5.4 and 
5.5 we can characterize coalescent events on [0, T samp i c ]. 
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Lemma 5.3. Fort > T c _i and j,k — 1,2, ...,n, V(£j(t)) = v e (no 
mutation events occur) and £j(t) 7^ Ikit) if j ^ k (no coalescent events 
occur). 

Proof. Consider the probability that the lj lineage experiences a mutation 
at time t > T e _i. For such an event to occur, a t> e -i — > v e mutation must 
occur and the resultant v e variant must be in the lj lineage. The rate of 
v e — j — > v e mutations is given by ii/yRhe e -i(t) which is trivially bounded 
by 0(/iE). By symmetry the v e variant resulting from a mutation is in lj 
with probabiliy 0( Ee 1 ^ ). By Proposition |TJ for t > T e -i this probability 

is bounded above by 0( —^grgi )■ From this we have, 

P(no mutation event on [7e-i, Tsample]) (5.19) 

Temple I 



In the last line above we have used the result r samp i c — T e _i = 0{8). To 
see this note that after T e _i, the v e variants expand deterministically. 
Arguments similar to those used in Proposition [I] show that v e will push 
v e -i to 0(8) levels in 0(8) time. 

The argument for no coalescent events is similar. For lj,lk to coa- 
lesce at time t, a v e variant must give birth to a new v e child cell, which 
occurs with rate 0(Ee e (i)) and £j(t),£k(t) must be, in no particular or- 
der, precisely these parent and child cells, which occurs with probability 
0(l/(Ee e (t)) 2 ). This leads to, 

P(no coalescent event on [T e _2, Tsampie]) (5.20) 

/'^sample j 



exp[-0( 



S 



2 K 2 S 2 ' 



/'' 



□ 



As mentioned, Lemma |5.3| allows us to consider [Tj,Ti+i] under the 
assumption that all lineages are of type Vi+2 at Ti+i. With this in mind, 
we introduce the following definitions. 

Tj =inf{t: V{£j(t)) = v i+ 2}, (5.21) 

Tj is the time of the Vi+i — > Vi+2 mutation event on £j and aj is the specific 
infected cell, a Vi+i variant, that produces the cell Ijijj), a Vi+2 variant. 
We use aj as a mnemonic for 'ancestor' since aj will be the ancestor of 
<i(T<+i)- 

Lemma 5.4. Let j, k £ {1, 2, . . . , n} and assume V(£j(Ti+i)) = Vi+2 for 
all j. Then Tj G [Ti,Ti+i] (a Vi+i — > Vi+2 mutation occurs on [Ti,Ti+i]) 
and V(£j(Ti)) = Wj+i. Further for j / k, lj(Ti) = £k(Ti) (a coalescent 
event has occurred) if and only if aj — au- 
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Proof. Proposition [T] shows that no Vi+2 variants exist at time Ti, so 
V(£j(Ti)) / v i+ 2. This immediately implies that Tj £ [Ti,T i+ i]. For 
t 6 [Ti,Tj], essentially the same arguments that gave Lemma |5.3| show 
that no Vi+i variant lineages experience mutation events prior to Ti. Con- 
sequently, we can conclude V(£j(Ti)) = Vi+i- 

Now we consider coalescent events. If Tj — Tk then we have aj = as 
required by the lemma. So now assume Tj > Tk, we want to show that the 
two lineages do not coalesce prior to Tj. Since Tj ^ Tk, tj and £k cannot 
coalesce during [tj,T;+i], otherwise we would necessarily have Tj = Tk- 
Further since ij and 1% are of different variant type during (rk,Tj\, no 
coalescent event occurs on (Tfc,7j]. On the interval [Ti,Tk), £k,£j are of 
type Vi+i and the same arguments that gave Lemma |5.3| show that Vi+i 
variants do not coalesce prior to Ti. We are left with the possibility of 
a coalescent event at time Tk- This would mean that £j(rk— ), which 
is of type Vi+i, produces a mutant child cell that is of type Vi+2 which 
is precisely £k{Tk)- However, the mutation event associated with £k(rk) 
is equally likely to be produced by any variant Vi+i at time Tk- The 
probability that £j(Tk~) is the cell chosen is ft which is bounded 

as follows, 

Ee l+1 (r fc ) < Ee l+1 (rO = 0( M 2 E 2 <5 2 ) ~* °' (5 ' 22) 

□ 

The random partition ~B.a,% characterizes the coalescent events that 



occur on [Ti, Zi+i]. From Lemma 5.4 we know that coalescent events are 



associated with ^+1 — > Vi+2 mutations during [T^Ti+i]. Mutations that 
occur relatively early in this time interval will, on average, produce more 
descendants at H+i. Consequently a lineage, £j, is more likely to descend 
from a mutation that occurs relatively early. The parameter A considers 
mutations that happen on the interval [T^Ta] where Ta is defined as 
the time at which the mean number of Ui+i — > Vi+2 mutations that are 
expected to occur is precisely A. Lemma |5.5| shows that the error in the 
approximation ~E.a,% collapses as A — > 00. 

Lemma 5.5. Let A be a fixed partition ofU(Ti+i) and A a fixed constant. 
Then, 

limP(n(T0 = A) = P(E A ,i(U(T i+1 )) = A) + O(i). (5.23) 

Proof. Let m be the mean number of Vi+i — > Vi+2 mutations on [Ti, Ti+i] 
and K a \\ a Poisson r.v. with mean m. Then there are K^w such mutations. 
Numbering the mutations in some arbitrary way, we let ef^}^ (Ti+i) be 
the number, unsealed, of ancestors at Ti+i that descend from mutation q. 
The total Vi+2 population at Ti+i is then given by, 

Ee I+2 (T l+1 ) = Y, 4+2 q \T l+1 ) (5.24) 

9 = 1 

We split the interval [Ti,T i+1 ] into the intervals [Ti,T A ] and [T A ,T i+ i]. 
We let m crror be the mean number of mutations in [Ta,T{+i]. As men- 
tioned, Ta is chosen so that the mean number of mutations in [Ti,TU] 
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is A. Then, we have m = A + m crror . Let Ma and M crr0 r be the set 
of mutations that occur on [T^Ta], [Ta,T;+:l] respectively. Then (5.241 
becomes 



,#,(<?') 

-i + 2 



(Ti-\ 



In section |A.3| of the appendix we show, 



„#.(«') 



o(±), 



and (5.251 reduces to 



Ee l+2 (T l+ i) 



q£M A 



m+i). 



(5.25) 



(5.26) 



(5.27) 



A simple computation shows that the number of M CI toi mutations is 
0([iE) while the number of Ma mutations is almost by definition O(A). 
However, Ma mutations happen early allowing their descendant popula- 
tion to expand at rate 2Ak for a longer time than Af orror mutations. On 
the other hand, the lateness of Mermr mutations means there will be more 
such mutations because the v i+ i population expands at rate Ak. Since 
the Vi+i population expands at half the rate of the Vi+2 population, the 
descendants of early mutations dominate. 

From | |5.27[ ), in the SPL each lineage cell £j(T i+ i) must descend from 
one of the Ma mutations. Notice that K in Definition [l] is precisely the 
number of Ma mutations. By Lemma 5.4 lj,lk coalesce during [Ti,Tj+i] 
if and only if they descend from the same mutation. Lineage j descends 
from mutation q with probability, 



„#.W frp \ 



„#.(?') 
1 e i+2 



(Ti- 



(5.28) 



In section 
show 



3] of the appendix through branching process asymptotics we 



o 1+2 (T l+1 )->CT W (5.29) 
constant that is independent of q, and is as defined in 



where C is 

the Results section just prior to Definition[T] Combining (5.27 1 and ( 5.29 \ 
gives the pop value decomposition (3.161 stated in the Results section. 
The constant C cancels out in the ratio ( 5.28[ ) and the resultant formula 
is precisely the 'coloring' probability given in Definition [T] Putting all 
these observations together proves the lemma. □ 

As mentioned in the Results section, we do not sample pop values even 
though (5.61 would allow it. Crucially, if we sampled pop values then to 
construct lineages we would need to determine ef^^ (Ti+i) conditioned 
on the pop value ei+2(Ii+i) . We do not know how to sample from this 
conditional dist ributi on, so instead we sample which is proportional to 



; i+2 (T 4+ i) (by ( 5.29 1) and thereby implicitly sample e i+ 2(T i+ i) through 
the decomposition ( |3.16 l. Finally we note that the C in (5.291 is difficult 
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to compute as it depends on h. Fortunately, lineage construction can 
proceed without knowing C. 

Lemmas |5 , 3||5 , 5| demonstrate Theorem |3.1| Starting at T samp i e , Lemma 
|5.3| shows that no coalescent events happen back to T e -i. The on each 
[Ti,Ti+i] for i — 0, 1, ...,e — 2 the coalescent events are described by 
Concatenating these Ha.i takes us back to To. By our assumption 
on initial conditions, a lemma very similar to |5.3| can be proved showing 
that no coalescent events occur on [0, To]. 



6 Full Escape Graph 

In this section we generalize the arguments used in Section [5] for the 
linear escape graph to the full escape graph. As we did for the linear 
escape graph, we divide the dynamics on the full escape graph into time 
intervals [T,T+i]. For the linear escape graph, the ith variant class, £i, 
is composed of a single variant Vi and the T are defined by e;(T) = S 
in ( |5.1[ ). To generalize this definition to the full escape graph, we let T, 
be the first time any of the variant populations in class i reaches a scaled 
population size 5: 

Ti = M{t : 3u G Si such that e v (t) > S}. (6.1) 

Sections |6. 1| and |6.2| generalize the results for the linear escape graph 
to the full escape graph. The arguments are similar, so we emphasize the 
novel ideas that apply to the full escape graph case. As in section [5j we 
base our analysis on consideration of an interval [Ti,Tj+i]. 



6.1 Dynamics 

During [T, Tj+i], for the linear escape graph, only Vi+i variants spawn 
Vi+2 variants. Recalling that V(v) is the set of variant types that can 
mutate into variant v £ £i+2, for the full escape graph all v' £ V(v) 
spawn v variants. For example 1100 can be spawned by 1000 or 0100. 



Consequently, the pop value result, (5.6 1, must be generalized as follows, 

e„(T i+ i) 



P v (6.2) 



where 



and 



v'ev(v) 



2 / a , \ 2 



Afc\ „,1 _ 2 fAk 



V 



where 



p +1 j \k i 5( 2- 1 '" It- } - (6 - 4) 

J max, i + l / \ rv% / ^ \ % / 

Pmax.j+i = max P v ». (6.5) 

v"es i+1 

With P v generalized from ( |5.7| l to ( |6.4[ ), a version of Proposition [l] gen- 
eralized in a similar way holds for the full escape graph. Namely, during 
[T,Ti+i] variants in £i sweep to dominance, pushing the Ei-\ to 0(8) 
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levels. Concurrently, £i+i variants rise to 0(8) levels while spawning £;+2 
variants. Spawning events occur for every v £ £i+2, v' £ V(v) pair and 
conclusion 2 of PropositionfTlis generalized to included all such P v <^ v pop 



values as described in ( 6.2 l-( |6.4[ ) 

Notice that the difference between ( 6.4 1 and (5.7l is the factor (P v > / Pmax,i+i) 2 



in ( |6.4[ ). As this difference is the main technical novelty in moving from 
a linear to a full escape graph, we focus on its derivation. 

At time Ti, all variants v' £ £i+i have e v '(Ti) « P v >fiE 2 5 2 . Conse- 
quently, since Ee„/(Tj) — > oo the e„/ dynamics are deterministic in the 



SPL and we have, 



e v '(t) 



^ p ~P.>IAt) (6.6) 

where t 

I v ,(t) = exp[J ds{jh(s) - k l+1 ]. (6.7) 

Notice that I v /(i) is dependent only on the variant class £i+i through 
fej-l-i but not on the specific variant v in £i+\. This leads to the ratio for 

v',v" £ £;+!, 

- (6-8) 

Recall that Tj+i is defined as the first time for which some variant 
v £ £i+i is of scaled population size 8. Let « m ax,i+i be that variant. 
Then, by definition 

e. maK , i+1 (T i+1 ) = <5, (6.9) 
Further, from (|6.6|) we know that P %Bj+1 = P max ,i+i. Plugging the 



equation directly above into (6.8 1 with v" = w m ax,i+i gives 



e v ,(T i+1 )= p^— 5. (6.10) 

\ Jmax,i+1 / 

The v' population size at Ti is too small to create v' — > v mutations 
(this is the content of conclusions 1 and 2 in Lemma JfTTJ. Consequently, 
if we want to know how many v variants are produced at Ti+i by v' — > v 
mutations and their descendants, all we have to know is e„'(T,:+i). In- 
deed, the arguments of section [5] show that when e,/ (Ti+i) = 3, then 
e„(T;+i) = fj, 2 ¥,5 2 S. But for the full escape graph, e„/(Ti+i) is given by 



(6.101 and consequently we must replace 8 by (P v i / P ma x,,%+i)& in (5.71. 



This substitution gives (|6.4 



6.2 Lineage Construction 

Lemmas |5.3| and |5.4| proved for the linear escape graph apply to the full 
escape graph with almost identical proofs. Lemma [5.5[ on the other hand, 
requires significant generalization. 

In the case of the linear escape graph, given an A we defined Ta on 
the [T<, Jj+i] interval as the time at which the number of Vi+i — > Vi+2 
mutations has mean A. More precisely, Ta was defined by, 

ds(njhE)e Vi+1 (s) = A. (6.11) 
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For the full escape graph, we generalize the definition of Ta by considering 
the variant « m ax,i+i (defined in the previous subsection): 

ds(/ry/iE)e„ max l+1 (s) = A. (6.12) 



Using (6.8l with Ti+i replaced by Ta we find that the mean number of 
v — > v mutations produced by Ta is A{P v i /P(« m ax,i+i))- In other words, 
each v' has an associated scaled version of A. 

For the linear escape graph, we sampled K ~ Poisson(y4) versions of 
and then applied the paintbox construction given by Definition]!] For 
the full escape graph, the idea is similar except that for each v' such that 
v — s> v mutations are possible we must take Poisson( J 4(P t ,//P(u maXj i+i)) 
samples of r' ! '. Then, we must assign different 'colors' for each such 
sample for each possible v' . 

Everything then proceeds according to a paintbox partition, except 
that the colors also tell us which variant in the £i+i class produced the 
lineage being colored. So, as described by Definition [3] we must not only 
coalesce lineages according to the paintbox construction, we must also 
allocate the lineages to the appropriate variants at time Ti. 

To implement all this, we estimate pop values through the decompo- 



sition D v given in (3.211. As we mentioned directly below the proof of 



Lemma 5.5 we do not sample pop values directly using ( 6.2 I because then 
we would need to consider conditional distributions from which we do not 
know how to sample. Notice that K v >^ v depends only on the ratio of pop 
values, and so we may use the D v to form K v i^, v . Since we take all £1 
variants to be of equal population size at t = 0, we may take D v — 1 for 
v G Ei to start the iteration. Once D v and the F^' have been sampled 
according to Definition [2] we move backwards from T samp i e to To and im- 
plement the paintbox construction of the H^.i, Definition [3j in order to 
coalesce lineages. 



A Appendix 

A.l Proof of Lemma 15.11 

In this section, we provide the technical details that support conclusions 
3 and 4 of Lemma |5.1| F rom conclusions 1 and 2, we know ei,e i+l <C 5 



and we can reduce (2.121 to the following 



^ = g (1 - h - h(e, + ei _! + 0(6))) (A.l.l) 

% i =7C i -i(ft-— ) + 0(m). 
at 7 

The O(p) terms in the last two equations directly above can be ignored 
because T/ 1 — Ti~ 0(\ log(<5)|) and fx\ log(S)\ —¥ 0. Dropping these terms, 



3G 



we note the following relation, 
d(log(e i 



log(ei_i)) 



dt 



Ak. 



(A.1.2) 



Integrating the above equation and using our assumptions on ej_i(Tj) and 
ei(Ti) we find, 

= 0(5exp[Afc(i - Ti)]). (A.1.3) 

ei-i(t) 



If we can show that ei is bounded then as t grows, (A.1.3 1 implies that 
collapses. To see that ej is bounded, set 



9 7 7 



Then by straightforward differentiation, 
l-h-h- 0(6) - 



dz 
dt 



ki k%—\ 
e, 6i_i- 



< 1 



7 7 
min(p(l - o(<5)), fci, fti-i)z 



(A.1.4) 



(A.1.5) 



Since is non-negative, we find that z(t) must be bounded. In turn 
h, a and d-i must be bounded. Returning to (A. 1.3| and setting t > 
2/Afc log(5)| we find, since ei is bounded, 



e t -i(t) = 0(S). 



Once < > 2/Ak\ log(<5)|, we can further reduce (A. 1.1 1 to, 



dh 
~dt 



g (l-h-h(ei + 0(5))) 

kj , 
7 



(A.1.6) 



(A.1.7) 



de, 



Consider then ( A.1.7 1. Ignoring the 0(S) term for a moment, the system is 
not dependent on S. Since we have shown h, ei to be bounded, application 
of Poincare-Bendixon shows that the system converges to its non-trivial 
equilibirum, fo = -± and a = ^j^- Now consider the 0(S) term. Given 
some fixed distance e > 0, if we run the system from t = 2/Afe|log(5)| 
to t = 3/Afc| log(<5)j we are guaranteed by choosing S sufficiently small to 
be within e of the equilibrium. In turn, taking e small, we can linearize 
(A.1.71 about its equilibrium. 

Straightforward computation shows that both eigenvalues of the lin- 
earized system have negative real part bounded above by, 

dk 2 k 

p = -(gj + 0(5)) min(l, ^-(1 - *)) (A.1.8) 

91 7 

Running the system from t = 3/Afc| log(<5)| to t = (3 + A)| log(<5)| forces 
( |A.1.7[ ) to within 0(5) of the equilibrium. 
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A.2 Proof of ([57181) 



In this section we provide technical details that justify (5.181. (5.181 
is what underlies our pop value formulas. The arguments we employ 
are extensions of those found in |23j which considered the LD c i aaa i c (t) 
process. To make this connection more explicit, where possible we adopt 
the notation of |23j . 

Following from the arguments made directly above (5.151, we assume 
the following: 

1. the Vi+i population expands deterministically from time T/ 1 to Tj+i 
with rate Ak + 0(8) 

2. at time T/ 1 no Vi+2 variants exist 

3. Vi+i — > Vi+2 mutations occur at rate fikiEei+\. 

4. Vi+2 variants have birth and death rates b = ki + O(S) and d — 
k l+2 + 0(8). 

We set r = b — d and by our assumptions on CTL attack, r = 2Ak + 0(5). 
We will develop our formulas for arbitrary b, d assuming only r > Ak. 

As observed in [23], the number of mutations is Poisson distributed 
with mean m given by, 

J +1 dsfj,kiEei +1 (Tt)exp[(Ak + 0(6)(s-T i+ i))] (A.2.9) 

By definition e;+i(T;+i) = 8. We can then estimate the integral expression 
for m and find, 

m = (|^) M E5exp[0(5)(r l+1 - if)]. (A.2.10) 

We want to show that exp [0(<5)(Ti+i-7f )] -> 1. Note the following three 
facts, 

• e i+ i(T i+ i) — 8 

• e i+1 (T?) > e i+ i(Ti) = 0(^ 2 E8 2 ) 

• On [T^Ti], e i+ i grows deterministically with rate Ak + O(S), 

In other words, we know where ei+i starts and ends, and we know its 
growth rate. Then, we can find the bound 

^-^ w^w - (A - 2 - n) 

And so by our definition of 8, ( |5.2| , we can conclude exp[0(8)(T i+ i — 
Ti)] — > 1. Plugging this observation into (A.2.10 1 gives, 

-r^ > 1. (A.2.12) 

A similar argument will show that the 0(8) expressions in the formulas 
for b, d and mutation rate will have no effect in the SPL. For simplicity 
then, we drop O(S) terms from this point on without comment. 
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Connecting to the notation of [23], we let Y m = ef +2 (Ti+i), the number 
of Vi+2 variants at time Tj+i. We can decompose Y m by 



(A.2.13) 



where K is Poisson distributed with mean m, given in ( |A.2,10[ ), and each 
Xk represents the number of Vi+2 variants at time T;+i that descend from 
a single Vi+2 variant produced by a Vi+i — > Vi+2 mutation (compare (2.1) 
in [23]). Conditioned on K, the times of the K mutations are iid. and in 
turn, the Xk are iid. 

The following lemma characterizes the Laplace transform of Xk ■ 

Lemma A.l. Let the Xk be iid versions of the r.v. X . Recall r — b — d. 
Then 



lim 

A->0 



E[exp[-XX]] - 1 



where 

T(6,d, Afc) 



Ak 



Ak 



= -T(fe,d, Afc) 



<K5 



CSC(7T 



Ak, 



(A.2.14) 



(A.2.15) 



Before proving Lemma |A.l[ we state and demonstrate Proposition |A.l| 
which characterizes the distribution of Y m for large m. ( 5.17 1 follows from 



Proposition A.l by setting r = 2Ak. (5.181 follows from the proposition 



by further setting m = (ki/Ak)fiR8 as justified by (A. 2. 12 1 
Proposition A.l. 

Y m „,Ak 



lim 



■S{ — ,l,T(r,Afc)) 



(A.2.16) 



To see that Proposition | A . 1 1 follows from Lemma [A. 1| first notice that 
by (A.2.13 1 and the independence of the Xk we have 



E[-\Y m ] = exp[m(E[exp[-\X]] - 1)]. (A.2.17) 

Replacing A by X/m r ' Ak in the above equality and applying Lemma A.l 
gives, 



lim E[-\-^] = exp[-A^T(M,Afc)l. 

i— J-oo 



(A.2.18) 



Stable distributions are characterized by index a, the skewness parameter 
/3, and the scale factor c [21] ■ The limit above is seen as the Laplace 
transform of a a — — stable distribution. Since the support of Y m is 
on [0,oo], /3 = 1 [24]. c is defined through the characteristic function of 
the stable process. To determine c we invert the characteristic function 
of S(Ak/r, l,c) and compute its Laplace transform. Setting this result 



equal to the right side of (A.2.18 1 we find 



c= T(b,d,Ak) 



1 + COs(7Tq) 

cos(^) 



(A.2.19) 
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In the case a = 1/2 we find 



c = 2T 2 {b,d,Ak) (A.2.20) 
If we plug in our values for b, d in the formula for T we find 



and so we arrive at 

'Aft 



Finally, we give the proof of Lemma [A. 1| 



(A.2.22) 



Proof of Lemma \A.l\ Let t* be the random time of a single mutation. A 
standard Poisson process argument applied to ( |A.2.9[ ) gives for the density 
of t*, 

P(t* =s) = Afcexp[-Afc(T i+ i - a)], (A.2.23) 

Let Z(t) be a branching process with birth and death rates b, d respectively 
run to time t assuming Z(0) = 1. Then X — Z(t*) and we can represent 
E[exp[-\X\] - 1 by, 

j-Ti+i 

E[exp[-\X]]-1 = / dsAkexp[-Ak{T i+l -s)]G(T i+1 -s), (A.2.24) 
where 

G(t) = exp[-\Z(t)] - 1. (A.2.25) 

Extending the range of integration from T/ 1 down to — oo also adds 
error that collapses in the SPL. With this observation and the change of 
variable s — > Ti+i — s we consider, 

E[exp[-\X]]-1= dsAkexp[-Aks]G{s) (A.2.26) 



To understand (A.2.26 1, we have found it useful to make the substi- 
tution w = Aexp[rs]. To explain why, we observe the well known fact, 
E[Z(t)] = exp[rt]. We expect then, 

G(t) w exp[-Aexp[ri]] - 1 = exp[-w] - 1. (A.2.27) 



As a result, at least for us, analyzing (A.2.26 1 in terms of w simplifies the 
contribution of the G(t) term to (A.2.26 1. Making the substitution gives, 

Ak At f°° 1 
E[exp[-XX]] - 1 = —A- / dw—^G(f(w)) (A.2.28) 

where f(w) = - log(y). The integral to the right of the equality directly 
above has a limit as A — s> 0. This is not obvious due to the singularity of 
— ^ at w — 0. To remove the singularity and show that the integral 
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is indeed O(l) we apply partial integration twice, integrating the 
term and differentiating the G(f) term. We find, 



E[exp[-\X] 



where 



Afc 

A~ 



Afc 



h 



(: 



-u;-— G(/H) 



L )[G(f(w))]" + I 1+ I 2 
(A.2.29) 

(A.2.30) 



Afc 



)w 1 ~[G{f(w))]' 



G can be characterized through a Kolmogorov equation. In our specific 
case 

(A.2.31) 



G' = bG(G+ {1 -=■)). 





(A.2.31 1 can be integrated to find a specific formula for G and using 
E[eacp[-\X]] - 1 



(A.2.29 1 we find 



lim 



A^ 



T(b,d, Afc), 



(A.2.32) 

□ 



A. 3 Proof of Lineage Construction Approxima- 
tion 

In this subsection, we demonstrate the two assertions used to prove Lemma 
which justifies our paintbox construction approximation of 11(0). 



5.5 



Lemma A. 2. 



q'eM c 



q£M A 



Ee^(T i+1 ) i 



< 1 (T.+i) 



(A.3.33) 



Proof. If we sum the numerator and denominator of (A.3.33 1, we have 
Y m which was intro duced in the previous section, see (A. 2. 13 1. We know 
through Proposition A.l that Y m /m 2 converges to a stable distribution. 
We now apply the arguments of Proposition | A . 1 1 separately to the numer- 
ator and denominator. The difference in the arguments will be that the 
bounds of integration in (A. 2. 28 1 will no longer be A and oo. However the 
same essential ideas are used and we find, 



Ee&CZWx) 

qeM A 



Afc 



(5 M E) 2 S(-,l,7r 2 



while 



J2 Ve%l{T i+1 )nO{{5^f)M{~,± z , 



ij'G-Morpo 



(A.3.34) 



(A.3.35) 
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where A/"(/i, a 2 ) is a normal distribution with mean fi and variance a 2 . No- 
tice that the early mutations produce a heavy tailed distribution, while 
the later mutations are normally distributed. By time Ta, the rate of mu- 
tation is O(A), as a result many mutations happen at approximately the 
same time and through an appropriate scaling these nearly simultaneous 
mutations lead to a central limit theorem. Taking the ratio of ( |A.3.34[ ) 
and ([A.3.35J gives (|A.3.33[) □ 



Lemma A. 3. 



where the T q are sampled from T^K 



(A.3.36) 



Proof. Let t* be the random time of a mutation on [T/ 1 , Ta] ■ Up to an 
error that disappears in the limit, we have 

P(T A -t* = s) = Afeexp[-Afcs]. (A.3.37) 

In other words Ta — t* is exponentially distributed with rate AA: and 
setting 

£ 9 , 2 = Ak(T A - f) (A.3.38) 
defines £ 9i 2 as a exponential r.v. with rate 1. When a v i+1 — > v i+ 2 mu- 
tation occurs at time t* , the number of descendants at Tj+i, Ee]i*) 2 (Ti+i) 
is given by Z(T+i — t*) (recall the definition of Z(t) from section A. 2 1. 
Since Ti+i — Ta — >■ oo and t* < Ta, we have Ti+i — t* — > oo. Standard 
asymptotic results from branching process theory (see [T]) give, 

k 9 Ah 

exp[-2Afc(T, - t*)]Z(Ti - f) -> ^^ B (^)> (A. 3. 39) 

where is exponential with rate 1 and B(p) is a Bernoulli r. v. with 
success probability p. We can apply this analysis to each Ee^+ 2 m ( A.3.36 1 
by multiplying the numerator and denominator by exp[— 2Ak(T — Ta)\ 
to find, 

Eeff 2 (T i+1 ) ; e X p[^k(T A -t*)]£- k UiB q ( 2 -^) 

J2 q eM A MH(T l+1 ) ~* ^ q , eMA exp[2Afc(TA - M*^) 

(A.3.40) 

exp[2C 9 , 2 ]^,iS 9 (^) 



J2 q , eMA eM^ g '^ q 'AB q ,( 2 -^) 

□ 
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